Goto

Collaborating Authors

 laplacian score





Marginal Laplacian Score

Hay, Guy, Volk, Ohad

arXiv.org Machine Learning

High-dimensional imbalanced data poses a machine learning challenge. In the absence of sufficient or high-quality labels, unsupervised feature selection methods are crucial for the success of subsequent algorithms. Therefore, there is a growing need for unsupervised feature selection algorithms focused on imbalanced data. Thus, we propose a Marginal Laplacian Score (MLS) a modification of the well-known Laplacian Score (LS) to be better suited for imbalance data. We introduce an assumption that the minority class or anomalous appear more frequently in the margin of the features. Consequently, MLS aims to preserve the local structure of the data set's margin. As MLS is better suited for handling imbalanced data, we propose its integration into modern feature selection methods that utilize the Laplacian score. We integrate the MLS algorithm into the Differentiable Unsupervised Feature Selection (DUFS), resulting in DUFS-MLS. The proposed methods demonstrate robust and improved performance on synthetic and public data sets.


Cluster Exploration using Informative Manifold Projections

Gerolymatos, Stavros, Evangelopoulos, Xenophon, Gusev, Vladimir, Goulermas, John Y.

arXiv.org Artificial Intelligence

Data exploration focuses on identifying informative patterns to discover new insight and knowledge about a collection of data. The often high-dimensional nature of such data renders the visual exploration process intractable for the human eye, and therefore specialized data manipulation of the original samples is essential in practice. Dimensionality reduction methods have been at the forefront of this challenge Bishop [2006] aiming to recover lower-dimensional embeddings of the original data that facilitate the identification of underlying data cohorts and help understand better the problem at hand. One of the most well known dimensionality reduction approaches perhaps is principal component analysis (PCA) Hotelling [1933], an efficient linear method aiming to maximizing the variance along the projection vectors, which in practice appears insufficient for meaningful separation of cohorts. A variety of non-linear methods have also been proposed that conversely focus on locally preserving the structure of the data such as Isomap Tenenbaum et al. [2000], LLE Roweis and Saul [2001], t-SNE van der Maaten and Hinton [2008], UMAP McInnes and Healy [2018], TriMap Amid and Warmuth [2019] and LargeVis Tang et al. [2016], etc. Projection pursuit (PP) Friedman and Tukey [1974], Caussinus and Ruiz-Gazen [2010] defines a family of dimensionality reduction methods that can enable various embedding effects depending on a suitably selected criterion. The kurtosis index Chiang et al. [2001] is one specific PP example that specializes in identifying "interesting" projections. Its minimization particularly penalizes the normality of the data distribution, promoting thus more meaningful separability when searching for clusters. The above approaches nevertheless share the same attribute of offering a single static projection that does not consider any prior knowledge a practitioner may have regarding the high-dimensional latent structure. Such projections can be uninformative as they tend to illustrate the most evident features which are often already known by the reader.


Revised Conditional t-SNE: Looking Beyond the Nearest Neighbors

Heiter, Edith, Kang, Bo, Seurinck, Ruth, Lijffijt, Jefrey

arXiv.org Artificial Intelligence

Conditional t-SNE (ct-SNE) is a recent extension to t-SNE that allows removal of known cluster information from the embedding, to obtain a visualization revealing structure beyond label information. This is useful, for example, when one wants to factor out unwanted differences between a set of classes. We show that ct-SNE fails in many realistic settings, namely if the data is well clustered over the labels in the original high-dimensional space. We introduce a revised method by conditioning the high-dimensional similarities instead of the low-dimensional similarities and storing within- and across-label nearest neighbors separately. This also enables the use of recently proposed speedups for t-SNE, improving the scalability. From experiments on synthetic data, we find that our proposed method resolves the considered problems and improves the embedding quality. On real data containing batch effects, the expected improvement is not always there. We argue revised ct-SNE is preferable overall, given its improved scalability. The results also highlight new open questions, such as how to handle distance variations between clusters.


Laplacian Score for Feature Selection

Neural Information Processing Systems

In supervised learning scenarios, feature selection has been studied widely in the literature. Selecting features in unsupervised learning scenarios is a much harder problem, due to the absence of class labels that would guide the search for relevant information. And, almost all of previous unsupervised feature selection methods are "wrapper" techniques that require a learning algorithm to evaluate the candidate feature subsets. In this paper, we propose a "filter" method for feature selection which is independent of any learning algorithm. Our method can be performed in either supervised or unsupervised fashion.


Multi-modal Differentiable Unsupervised Feature Selection

Yang, Junchen, Lindenbaum, Ofir, Kluger, Yuval, Jaffe, Ariel

arXiv.org Artificial Intelligence

Multi-modal high throughput biological data presents a great scientific opportunity and a significant computational challenge. In multi-modal measurements, every sample is observed simultaneously by two or more sets of sensors. In such settings, many observed variables in both modalities are often nuisance and do not carry information about the phenomenon of interest. Here, we propose a multi-modal unsupervised feature selection framework: identifying informative variables based on coupled high-dimensional measurements. Our method is designed to identify features associated with two types of latent low-dimensional structures: (i) shared structures that govern the observations in both modalities and (ii) differential structures that appear in only one modality. To that end, we propose two Laplacian-based scoring operators. We incorporate the scores with differentiable gates that mask nuisance features and enhance the accuracy of the structure captured by the graph Laplacian. The performance of the new scheme is illustrated using synthetic and real datasets, including an extended biological application to single-cell multi-omics.


Deep Unsupervised Feature Selection by Discarding Nuisance and Correlated Features

Shaham, Uri, Lindenbaum, Ofir, Svirsky, Jonathan, Kluger, Yuval

arXiv.org Machine Learning

Modern datasets often contain large subsets of correlated features and nuisance features, which are not or loosely related to the main underlying structures of the data. Nuisance features can be identified using the Laplacian score criterion, which evaluates the importance of a given feature via its consistency with the Graph Laplacians' leading eigenvectors. We demonstrate that in the presence of large numbers of nuisance features, the Laplacian must be computed on the subset of selected features rather than on the complete feature set. To do this, we propose a fully differentiable approach for unsupervised feature selection, utilizing the Laplacian score criterion to avoid the selection of nuisance features. We employ an autoencoder architecture to cope with correlated features, trained to reconstruct the data from the subset of selected features. Building on the recently proposed concrete layer that allows controlling for the number of selected features via architectural design, simplifying the optimization process. Experimenting on several real-world datasets, we demonstrate that our proposed approach outperforms similar approaches designed to avoid only correlated or nuisance features, but not both. Several state-of-the-art clustering results are reported.


Differentiable Unsupervised Feature Selection based on a Gated Laplacian

Lindenbaum, Ofir, Shaham, Uri, Svirsky, Jonathan, Peterfreund, Erez, Kluger, Yuval

arXiv.org Machine Learning

Scientific observations may consist of a large number of variables (features). Identifying a subset of meaningful features is often ignored in unsupervised learning, despite its potential for unraveling clear patterns hidden in the ambient space. In this paper, we present a method for unsupervised feature selection, and we demonstrate its use for the task of clustering. We propose a differentiable loss function that combines the Laplacian score, which favors low-frequency features, with a gating mechanism for feature selection. We improve the Laplacian score, by replacing it with a gated variant computed on a subset of features. This subset is obtained using a continuous approximation of Bernoulli variables whose parameters are trained to gate the full feature space. We mathematically motivate the proposed approach and demonstrate that in the high noise regime, it is crucial to compute the Laplacian on the gated inputs, rather than on the full feature set. Experimental demonstration of the efficacy of the proposed approach and its advantage over current baselines is provided using several real-world examples.